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We consider a (sub-)critical Galton-Watson process with neutral 
mutations (infinite alleles model), and decompose the entire popula- 
tion into clusters of individuals carrying the same allele. We specify 
the law of this allelic partition in terms of the distribution of the num- 
ber of clone-children and the number of mutant-children of a typical 
individual. The approach combines an extension of Harris representa- 
tion of Galton-Watson processes and a version of the ballot theorem. 
Some limit theorems related to the distribution of the allelic partition 
are also given. 

1. Introduction. We consider a Galton-Watson process, that is, a popu- 
lation model with asexual reproduction such that at every generation, each 
individual gives birth to a random number of children according to a fixed 
distribution and independently of the other individuals in the population. 
We are interested in the situation where a child can be either a clone, that 
is, of the same type (or allele) as its parent, or a mutant, that is, of a new 
type. We stress that each mutant has a distinct type and in turn gives birth 
to clones of itself and to new mutants according to the same statistical law 
as its parent, even though it bears a different allele. In other words, we 
are working with an infinite alleles model where mutations are neutral for 
the population dynamics. We might as well think of a spatial population 
model in which children either occupy the same location as their parents 
or migrate to new places and start growing colonies on their own. This 
quite basic framework has been often considered in the literature (see, e.g., 
[5, 14, 23, 31, 34, 39]); we also refer to [1, 6, 7, 28, 30, 37] for interesting 
variations (these references are of course far from being exhaustive). Note 
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also that Galton-Watson processes with mutations can be viewed as a spe- 
cial instance of multitype branching processes (see Chapter V in Athreya 
and Ney [8] or Chapter 7 in Kimmel and Axelrod [26]). 

We are interested in the partition of the population into clusters of in- 
dividuals having the same allele, which will be referred to as the allelic 
partition. Statistics of the allelic partition of a random population model 
with neutral mutations have been first determined in a fundamental work 
of Ewens [20] for the Wright-Fisher model (more precisely this concerns 
the partition of the population at a fixed generation). Kingman [27] pro- 
vided a deep analysis of this framework, in connection with the celebrated 
coalescent process that depicts the genealogy of the Wright-Fisher model. 
We refer to [9, 10, 15, 16, 33] for some recent developments in this area 
which involve some related population models with fixed generational size 
and certain exchangeable coalescents. 

The main purpose of the present work is to describe explicitly the struc- 
ture of the allelic partition of the entire population for Galton-Watson pro- 
cesses with neutral mutations. We will always assume that the Galton- 
Watson process is critical or subcritical, so the descent of any individual 
becomes eventually extinct, and in particular the allelic clusters are finite 
a.s. We suppose that every ancestor (i.e., individual in the initial population) 
bears a different allele; it is convenient to view each ancestor as a mutant of 
the zeroth kind. We then call mutant of the first kind a mutant-child of an 
individual of the allelic cluster of an ancestor, and the set of all its clones 
(including that mutant) a cluster of the first kind. By iteration, we define 
mutants and clusters of the kth kind for any integer k>0. 

In order to describe the statistics of the allelic partition, we distinguish an 
ancestor which will then be referred to as Eve, and focus on its descent. The 
set of all individuals bearing the same allele as Eve is called the Eve cluster. 
The Eve cluster has obviously the genealogical structure of a Galton-Watson 
tree with reproduction law given by the distribution of the number of clone- 
children of a typical individual. Informally, the branching property indicates 
that the same holds for the other clusters of the allelic partition. Further, 
it should be intuitively clear that the process which counts the number of 
clusters of the A:th kind for A; > is again a Galton-Watson process whose 
reproduction law is given by the distribution of the number of mutants of 
the first kind; this phenomenon has already been pointed at in the work 
of Tai'b [39]. That is to say that, in some loose sense the allelic partition 
inherits branching structures from the initial Galton-Watson process. Of 
course, these formulations are only heuristic and precise statements will 
be given later on. We also stress that the forest structure which connects 
clusters of different kinds and the genealogical structure on each cluster are 
not independent since, typically, the number of mutants of the first kind 
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who stem from the Eve cluster is statisticahy related to the size of the Eve 
cluster. 

Our approach essentially relies on a variation of the well-known connec- 
tion due to Harris [24, 25] between ordinary Galton-Watson processes and 
sequences of i.i.d. integer-valued random variables. Specifically, we incorpo- 
rate neutral mutations in Harris representation and by combination with the 
celebrated ballot theorem (which is another classical tool in this it is 

expounded, e.g., by Pitman; see Chapter 6 in [36]), we obtain expressions for 
the joint distribution of various natural variables (size of the total descent of 
an ancestor, number of alleles, size and number of mutant-children of an al- 
lelic cluster) in terms of the transition probabilities of the two-dimensional 
random walk which is generated by the numbers of clone-children and of 
mutant-children of a typical individual. 

We also investigate some limit theorems in law; typically we show that 
when the numbers of clone-children and mutant-children of an individual are 
independent (and some further technical conditions), the sequence of the rel- 
ative sizes of the allelic clusters in a typical tree has a limiting conditional 
distribution when the size of the tree and the number of types both tend 
to infinity according to some appropriate regime. The limiting distribution 
that arises has already appeared in the study of the standard additive coa- 
lescent by Aldous and Pitman [6]. We also point at limit theorems for allelic 
partitions of Galton-Watson forests, where, following Duquesne and Le Gall 
[17, 18], the limits are described in terms of certain Levy trees. In particu- 
lar, this provides an explanation to a rather striking identity between two 
self-similar fragmentation processes that were defined on the one hand by 
logging the Continuum Random Tree according to a Poisson point process 
along its skeleton [6] , and on the other hand by splitting the unit-interval at 
instants when the standard Brownian excursion with a negative drift reaches 
new infima [11]. 

2. Allelic partitions in a Gallon Watson forest. We first develop some 
material and notation about Galton-Watson forests with neutral mutations, 
referring to Chapter 6 in Pitman [36] for background in the case without 
mutations. 

2.1. Basic setting. Let 

be a pair of nonnegative integer-valued random variables which should be 
thought of respectively as the number of clone-children and the number of 
mutant-children of a typical individual. We also write 
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for the total number of children, and assume throughout this work that 

E(e(+))<i, 

that is, we work in the critical or subcritical regime. We implicitly exclude 
the degenerate case when ^('^^ = or ^^^^ = and, as a consequence, the 
means E(^('=)) and E(^('^)) are always less than 1. 

We write Z+ and N for the sets of nonnegative integers and positive 
integers, respectively. A pair {g,n) G Z4, x N is then used to identify an 
individual in an infinite population model, where the first coordinate g refers 
to the generation and the second coordinate n to the rank of the individual 
of that generation (we stress that each generation consists of an infinite 
sequence of individuals). We assume that each individual at generation g + 1 
has a unique parent at generation g. We consider a family 

{^g^n : 9 G Z+ and n G N) 
of i.i.d. copies of ^ which we use to define the Galton- Watson process with 

neutral mutations. Specifically, ^g^n = ^g"n) is the pair given by the 
number of clone-children and mutant-children of the nth individual at gen- 
eration g. We may assume that the offspring of each individual is ranked, 
which induces a natural order at the next generation by requiring further 
that if {g,n) and {g,n') are two individuals at the same generation g with 
n < n', then at generation g + I the children of {g,n) are all listed before 
those of {g,n'). 

2.2. Encoding the Galton-Watson forest with mutations. Next, we enu- 
merate as follows the individuals of the entire population (i.e., of all gener- 
ations) by a variation of the well-known depth-first search algorithm that 
takes mutations into account. We associate to each individual a label (a, m, s), 
where a G N is the rank of the ancestor in the initial population, m the 
number of mutations and s a finite sequence of positive integers which keeps 
track of the genealogy of the individual. Specifically, the label of the ath 
individual in the initial generation (7 = is (a, 0, 0). If an individual at the 
gt\i generation has the label {a,m,{ii, . . . ,ig)), and if this individual has 
j^'^^ clone-children and j*-™-* mutant-children, then the labels assigned to its 
clone-children are 

(a,m, (ii, ...,ig, 1)), . . . , (a,m, (ii, . . . , i^, j^'^))), 

whereas the labels assigned to its mutant-children are 

(a,m + l,(ii,...,ig,j(^) + l)),...,(a,m + l,(ii,...,ig,j(^)+i('^))). 

Clearly, any two distinct individuals have different labels. We then intro- 
duce the (random) map 

/9:N^Z+ X N, 
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Fig. 1. Depth-first search with mutations on a genealogical tree. The symbols •, "s?, 0, J|> 
represent the different alleles. Left: the label (m, s) of an individual is given by the number 
m of mutations and the sequence s that specifies its genealogy; for the sake of simplicity, 
the rank a of the ancestor has been omitted. Right: the same tree with individuals ranked 
by the depth-first search algorithm with mutations. 

which consists in ranking the individuals in the lexicographic order of their 
labels; see Figure 1. That is to say that p{i) = {g,n) if and only if the 
ith individual in the lexicographic order of labels corresponds to the nth 
individual at generation g. This procedure for enumerating the individuals 
will be referred to as the depth-first search algorithm with mutations. We 
shall also use the notation 

and whenever no generation is specified, the terminology ith individual will 
implicitly refer to the rank of that individual induced by depth-first search 
with mutation, that is, the ith individual means the nth individual at gen- 
eration g where p{i) = {g,n). 

Lemma 1. (i) The variables (,i,^2,--- o,re i.i.d. with the same law as ^. 
(ii) The sequence {S,g,n '■ g G ^+ n G N) can be recovered from j : i £ 
N) a.s. 

Proof. It should be plain from the definition of the depth- first search 
algorithm with mutations that for every i G N, p{i + 1) is a deterministic 
function of ^i, . . . , which takes values in x N) \ {p(l), • • • , p{i)}- Since 
{^9,n ■ g £ ^+ and n E N) is a sequence of i.i.d. variables with the same law as 

this yields the first claim by induction. The second claim follows from the 
fact that each individual has a finite descent a.s. [because the Galton- Watson 
process is (sub-)critical], which easily entails that the map p is bijective. 
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Further, it is readily seen that the inverse bijection is a function of the 
sequence (^j : i G N) . □ 



Henceforth, we shah therefore encode the Galton-Watson process with 
neutral mutations by a sequence (^j : i G N) of i.i.d. copies of ^. We denote 
by (^j)jgN the natural filtration generated by this sequence. 

We next briefly describe the genealogy of the Galton-Watson process as 
a forest of i.i.d. genealogical trees. Denote for every n G N by 

so that ai = 1 < 02 < • • • is the increasing sequence of the ranks of ancestors 
induced by the depth-first search algorithm with mutations. For example, 
02 = 13 in the situation described by Figure 1. The procedure for labeling 
individuals ensures that the descent of the ith. ancestor ai corresponds to 
the integer interval 

[ai,ai+i[:= {ai,ai + 1,. . .,«,+! - 1} 

(that is to say, if we index the population model using generations, then the 
descent of (0,i) is the image of [ai,aj+i[ by the inverse bijection p~^). 
We write 

Tj := iS,a,~i+£ ■l<i< ai+i - ai) 

for the finite sequence of the numbers of clone-children and mutant-children 
of the individuals in the descent of the zth ancestor. So Tj encodes (by the 
depth-first search algorithm with mutations) the genealogical tree of the 
ith ancestor, and it should be intuitively clear that the family (Tj : i G N) 
is a forest consisting in a sequence of i.i.d. genealogical trees. To give a 
rigorous statement, it is convenient to introduce the downward skip- free (or 
left-continuous) random walk 

(1) 5(+):=ci+^ + ---+ei+)-n, nGZ+, 
and the passage times 

(2) i;^+^=inf{n>0:5(+) = -i}, i G Z+. 

We stress that the T- ~^^ form an increasing sequence of (J^„)-stopping times. 



Lemma 2. There is identity 

a, - 1 = T^+i^ 

for every i G N and, as a consequence, the sequence Ti, . . . is i.i.d. 
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Proof. This formula is a close relative of the classical identity of Dwass 
[19] and would be well known if individuals were enumerated by the usual 
depth- first search algorithm (i.e., without taking care of mutations), see, for 
example, Lemma 63 in [36] or [29]. The proof in the present case is similar. 
Indeed the formula is obvious for i = 1, and for i = 2, we have on the one 
hand that 

by expressing the fact that the predecessor of the second ancestor found 
by depth-first search with mutations has a rank given by the size of the 
population generated by Eve, that is. Eve herself and her descendants. On 

the other hand, we must have 1 -|- ^['^^ H h .^n^^ > n when n < 02 — 1, since 

otherwise the depth-first search algorithm with mutations would explore 
the second ancestor before having completed the exploration of the entire 
descent of Eve. This proves the identity for i = 2, and the general case then 
follows by iteration. Finally, the last claim is an immediate consequence of 
Lemma l(i) and the strong Markov property. □ 

2.3. Allelic partitions. We can now turn our attention to defining allelic 
partitions. In this direction, recall that every ancestor has a different type 
(i.e., bears a different allele), and thus should be viewed as an initial mutant. 
More generally, we call mutant an individual which either belongs to the 
initial generation or is the mutant-child of some individual, and then write 

1 = /il < /i2 < • • • 

for the ranks of mutants in the depth- first search algorithm with mutations. 
For example, fi2 = 6, fJ-s = 7, fi^ = 10, /is = 12 and fJ.Q = a2 = 13 in the 
situation depicted by Figure 1. The upshot of this algorithm is that the set 
of individuals that bear the same allele as the jth mutant /.ij corresponds 
precisely to the integer interval [/ij,/ij+i[. In this direction, it is therefore 
natural to introduce for every j G N the jth allelic cluster 

'■= : 1 < ^ < IJ-j+i - fJ-j), 

that is, Cj is the finite sequence of the numbers of clone-children and mutant- 
children of the individuals bearing the same allele as the jth mutant. The 
sequence (Cj)jgN encodes the allelic partition of the entire population. 

Remarks. 1. Each allelic cluster Cj is naturally endowed with a struc- 
ture of rooted planar tree which is induced by the Galton-Watson process. 
More precisely, the latter is encoded via the usual depth-first search algo- 
rithm by the sequence (^^'^Li^^ : 1 < ^ < /"i+i ~ l^j)'-> particular the jth. mu- 
tant fij is viewed as the root (i.e., ancestor) of the cluster Cj. In other words, 
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the depth-first search algorithm with mutations for the Galton- Watson pro- 
cess induces precisely the usual depth-first search applied to the forest of 
allelic clusters viewed as a sequence of planar rooted trees. 

2. We also stress that the initial Galton- Watson process can be recovered 
from the allelic partition (Cj)jgN- Indeed, the previous observation shows 
how to construct the portion of the genealogical tree corresponding to the 
allelic cluster generated by an initial mutant, and the latter also contains 
the information which is needed to identify the mutant-children of the first 
kind. Mutant-children of the first kind are the roots of the subtrees corre- 
sponding to the allelic clusters of the second kind, and by iteration the entire 
genealogical forest can be recovered. 

Just as above, it is now convenient to introduce the downward skip-free 
random walk 

(3) 4^):=cf) + --- + el^)-n, nGZ+, 
and the passage times 

(4) :=inf{n>0:Si^) = -i}, j G Z+. 
Again, the T^'^^ form an increasing sequence of (.Fj)-stopping times. 

Lemma 3. There is identity 

for every j E N. As a consequence, for every j £ N, Cj is adapted to the 
sigma-field !Frp{c) , whereas Cj+i is independent of ^j.{c) and has the same 

distribution as Ci . In particular the sequence of the allelic clusters Ci, C2, • • . 
is i.i.d. 

The proof is similar to that of Lemma 2 and therefore omitted. 
We also introduce the number of alleles, that is, of different types, which 
are present in the iih. tree Tj: 

Ai := Card{j G N:/Uj G [ai,ai+i[}; 

for example, Ai = b m. the situation described by Figure 1. Note that there 
is the alternative expression 

ai<f<ai+i 

Corollary 1. (i) For every i G Z+, we have 
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equivalently, there is the identity 

rpi + ) _ rp{c) 

-^i Ai+---+Ai- 

(ii) The allelic partition of the tree Tj , which is induced by restricting the 
allelic partition of the entire population to Tj, is given by 

{CA,+...+A,^,+e:l<l<Ai). 

As a consequence, the sequence of the allelic partitions of the trees Tj for 
« E N, is i.i.d. 

Proof, (i) The first identity should be obvious from the definition of 
the depth-first search with mutations, as Ai -\- ■ ■ ■ + Ai is the number of 
alleles which have been found after completing the exploration of the i first 
trees and the next mutant is then the (i + l)th ancestor. The second then 
follows from Lemmas 2 and 3. 

(ii) The first assertion is immediately seen from (i) and the definitions 
of the trees and of the allelic clusters. Then observe that the number Ai of 
alleles in the tree Tj is a function of that tree, and so is the allelic partition. 
The second assertion thus derives from Lemma 2. □ 

It may be interesting to point out that {T^^\i > 0) and {Tj^\j > 0) are 
both increasing random walks. The range 

^{+) .^|2..(+).^>o} 

is the set of predecessors of ancestors (in the depth-first search algorithm 
with mutations), whereas 

corresponds to predecessors of mutants. These are two regenerative subsets 
of in the sense that each can be viewed as the set of renewal epochs of 
some recurrent event (cf. Feller [21, 22]). Observe that both yield a partition 
of the set of positive integers into disjoint intervals: 

that correspond respectively to the trees in the Galton-Watson forest and 
to the allelic clusters. By Corollary l(i), there is the embedding 

and more precisely, this embedding is compatible with regeneration, in the 
sense that for every k G conditionally on /c G TZ^~^\ the shifted sets 
7^(+) oek:={i>0:k + i£ 7^(+)} and 7^('=) o 0f, ■= {j>0:k + j £ 7^('=)} are 
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independent of the sigma-field Th generated by (^i, . . . and their joint 
law is the same as that of (T^^"*") , T^'-'^-* ) . We refer to [11] for apphcations of 
this notion. Roughly speaking, this implies that the allelic split of each inter- 
val \r\'^,T\'^^\ produces smaller intervals \r^}_^^T'^\ in a random way that 

only depends on the length t'i^^ — T^^l (i.e., the size of Tj), independently 
of its location and of the other integer intervals. This can be thought of as 
a fragmentation property (see [13]) for the sizes of the trees. 

2.4. Allelic trees and forest. In order to analyze the structure of allelic 
partitions, we introduce some related notions. The genealogy of the popu- 
lation model naturally induces a structure of forest on the set of different 
alleles. More precisely, we enumerate this set by declaring that the jth allele 
is that of the jth cluster Cj, and define a planar graph on the set of alleles 
(which is thus identified as N) by drawing an edge between two integers j < k 
if and only if the parent of the kth mutant /x^ is an individual of the jth 
allelic cluster Cj. This graph is clearly a forest (i.e., it contains no cycles), 
which we call the allelic forest, and more precisely the ith allelic tree is that 
induced by the mutant descent of the ith ancestor Oj. In other words, the 
ith allelic tree is the genealogical tree of the different alleles present in Tj. 
In particular, the sequence of allelic trees is i.i.d. and their sizes are given 
by {A,ieN). 

Recall that the breadth-first search in a forest consists in enumerating 
individuals in the lexicographic order of their labels, where the label of the 
nth individual at generation g is now given by the triplet (a,g,n), with a 
the rank of the ancestor at the initial generation. After a (short) moment of 
thought, we see that the definition of depth- first search with mutations for 
the Galton- Watson process ensures that the labeling of alleles by integers 
agrees with breadth- first search on the allelic forest, in the sense that the 
jth allele is found at the jth step of the breadth-first search on the allelic 
forest. 

For every j £ N, we consider the number of new mutants who are gener- 
ated by the jth allelic cluster, viz. 

M,:= 

fj.j<£<fj.j+i 

For instance, we have Mi = 3, M4 = 1 and M2 = M3 = M5 = in the sit- 
uation depicted by Figures 1 and 2. The allelic forest is thus encoded by 
breadth- first search via the sequence {Mj,j G N). 

Lemma 4. The sequence {Mj,j G N) is i.i.d., and therefore the allelic 
forest is a Galton-Watson forest with reproduction law the distribution of 
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Ml. As a consequence, the size A\ of the first allelic tree is given by the 
identity 

Ai = mm{j > 1 : Ml H \- Mj =j -1}, 

showing that Ai is an {J- (c)) -stopping time. 

3 

Proof. Recall from Lemma 3 that the sequence Ci, C2, • • • of the allelic 
clusters is i.i.d. Clearly, each variable Mj only depends on Cj, which entails 
our first claim. The second follows from the well-known fact that breadth- 
first search induces a bijective transformation between the distributions of 
(sub-)critical Galton-Watson forests and those of i.i.d. sequences of integer- 
valued variables with mean less than or equal to one (see, e.g., Section 6.2 
in [36]). 

Finally, the identity for the number Ai of alleles present in the tree Ti 
follows from the preceding observations and again a variation of the cele- 
brated formula of Dwass [19] (see, e.g.. Lemma 2 in the present work), as 
plainly, Ai coincides with the total size of the first tree in the allelic forest. 
□ 

3. Some applications of the ballot theorem. We start by stating a version 
of the classical ballot theorem that will be used in this section; see [40]. Let 
(Xi, . . . ,Xn) be an n-tuple of random variables with values in some space 
E, which is cyclically exchangeable, in the sense that for every i S N, there 
is the identity in law 

(Xi, . . . , Xn) = (X«+l, • • • , XiJ^n), 

where we agree that addition of indices is taken modulo n. Consider a func- 
tion 

/:E^{-1,0,1,2,...} 
1 <> 




Fig. 2. Allelic tree corresponding to the genealogical tree with mutations in Figure 1. The 
labels represent the sizes of the allelic clusters. 
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and assume that 

n 

for some 1 <k <n. 

Lemma 5 (Ballot theorem). Under the assumptions above, the proba- 
bility that the process of the partial sums of the sequence f{Xi), . . . ,f{Xn) 
remains above —k until the n-step is 




3.1. Distribution of the allelic tree. We have now introduced all the tools 
which are needed for describing some statistics of the allelic partition of a 
Galton-Watson tree with neutral mutations. We only need one more nota- 
tion. We write 

(5) vrfc^, = P(e(-) = A:,e(°^)=^), 

for the probability function of the reproduction law of the Galton-Watson 
process with mutations. For every integer n > 1, we also write vr*" for the 
nth convolution product of that law, that is, 

vTfc} = p(d'^ + • • • + = K st^ + • • • + = 

Example. Suppose that the dynamics of the population can be de- 
scribed as follows. We start from a usual Galton-Watson process with re- 
production law on Z+, say and assume that at each step mutations affect 
each child with probability p G ]0, 1[, independently of the other children. In 
other words, the allelic forest is obtained by pruning or percolation on the 
genealogical forest of the Galton-Watson process, cutting each edge with 
probability p and independently of the other edges. See, for example, Al- 
dous and Pitman [5] or Chapter 4 in Lyons and Peres [31]. Analytically, 
this means that if ^ is a random variable with law g, then the conditional 
distribution of given S^ = k \s that of {k — B{k,p), B{k,p)), where 

B{k,p) denotes a binomial variable with parameters k and p. In this situa- 
tion, it is easily seen that 

(6) <.=(^j:^)(i-p)V^>i". 

with Q*"" denoting the nth convolution power of g. This expression is entirely 
explicit when g is, for example, the Poisson, or binomial or geometric, distri- 
bution as in those cases, there are known formulas for g*"". Of course, there 
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are other natural examples in which the two-dimensional probability func- 
tion TT*" can be expressed in terms of simpler one-dimensional probability 
functions, for instance, when ^^^^ and are assumed to be independent or 
when (^^'^^ = and = (1 — where /3 stands for a Bernoulli variable 
which is independent of ^. 

Corollary 1 enables us to restrict our attention to the allelic partition of 
the tree generated by a typical ancestor, say for simplicity, Eve. Recall that 
T^~^^ denotes the size of the genealogical tree Ti of Eve, that Ai is the number 
of alleles found in Ti and that the jth allelic cluster Cj generates Mj mutant- 
children. Further, we know from Lemma 4 that the first allelic tree is encoded 
by breadth- first search via the finite sequence {Mj,l <j < Ai). The latter 
only retains partial information about the structure of the allelic partition 
of Ti, and thus it is natural to enrich it by considering more generally the 
sequence of pairs ((|Cj|, Mj), 1 < j < ^i), where 

|Cj| :=^j+i -^j 

denotes the size of the jth allelic cluster, that is, the number of individuals 
having the jth type. In other words, we enrich the allelic tree by assigning 
to each allele the size of the corresponding allelic cluster. We may now state 
our main result, which can be viewed as a generalization of a celebrated 
identity due to Dwass [19]. 

Theorem 1. (i) The joint law of the size of Ti and its number of alleles 
is given by 

P(t(+) = n,A, = k) = i<1fc,fc_i, l<k<n. 

(ii) The joint law of the size of the Eve cluster and the number of its 
mutant- children is given by 

P(|Ci|=n,Mi = £) = -<'ii,, n>l and£>0. 
n 

(iii) For every integers k>l, ni, . . . , > 1 and £i, ■ . ■ ,ik^O such that 

j 

£i> j — 1 whenever 1 < j < k, 

1=1 

we have 

k 

P(|Ci| = m, Ml = 4, . . . , \<Ck\ = Uk, Mk = 4) = n -<^-iA- 

1=1 * 
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Remarks. 1. Restricting our attention in part (iii) to sequences 
4,...,4 >0 with 

inf|i>l:^^,=i-l|=A:, 

we stress that the statement describes the law of the entire allehc tree. 
2. In particular, the law of the number Ai of alleles is given by 

oo 
n=l 

It may be interesting to point out that there is also the formula 

FiA, = k) = l4'l„ 

where 

oo 
n=l 

and u*'^ the kth convolution power of u. Indeed, this alternative formulation 
is seen from Lemma 4 and Dwass formula [19]. 

Proof. Recah that • ■ • , 

is a sequence of n i.i.d. 
copies of ('^^'^\^^™^) and consider the partial sums of coordinates 

Sf =Eef\ Sf)=^cf) and S, = sf + 

i=l i=l 

Introduce for every 1 <k <n the event 

An-k,k-i = {4'^ =n-k,^l^^ = k-l} = {En = n-l, S^) = k-l} 

and observe that the sequence (Ci'^^ ^j™^ ),..., (^n^\^n™'*) is (cyclically) ex- 
changeable conditionally on An-k^k-i- Further, we have by definition that 

IP(An-fc,fc-l) = ■n-n'^k,k-l- 

Plainly, there is the identity 

{T^^^ = n, Ai = k} = An-k,k-i n {min{i > I :^ = j - 1} = n} 
as, according to Lemma 2, 

min{i > 1 : = j - 1} = min{i > 1 : s]"^^ = -1} = 
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By the ballot theorem [take f{x^'^\x^^^) = x^^^ + x^^^ — 1 in Lemma 5], we 
have 

P(min{j > l:Sj = j - 1} = n I A„_fc,fc_i) = l/n, 

which yields (i). 

The proof of (ii) is similar, observing that 

{\Ci\=n,Mi=i}= A„„i,^ n {min{j > 1 : =j-l}=n}. 

Finally (iii) follows by iteration from (ii) and the fact that conditionally on 
Ai> j + 1, the {j + l)th allelic cluster Cj+i is independent of (Cfc, 1 < A; < j) 
and has the same distribution as the Eve cluster Ci (see Lemma 3). □ 

3.2. Conditioning on the population size and the number of alleles. In 
the rest of this section, we will be interested in the relative sizes of clusters 
in the allelic partition of the first tree Ti, ignoring their connections. We 
start with a description which is essentially a variation of that in Theorem 
l(iii). Recall that a random uniform cyclic permutation of {1, . . . , A:}, say cr, 
is given by a{i) = U + i where U is uniform on {1, . . . ,k} and the addition 
is taken modulo k. 

Corollary 2. Fix 1 < k <n and let a be a random uniform cyclic per- 
mutation of {1, ... ,k} which is independent of the Galton-Watson process. 
Then for every collection of positive integers ni, . . . , with ni + • • • + 71^ = n, 
we have 

P(|C,(i) I = ni, . . . , IQ(fc) I = nfc I t[-^^ =n,A^ = k) 
k , 

— ^ \ " TT J_ *ni 

~ l.„*n 2-^ iL „ .'^rii-l/i' 

'^^n-k,k-l i=i 

where in the right-hand side, the sum is taken over the sequences ii,. . . ,ik 
in Z+ such that ii -\- ■ ■ ■ -\- = k — 1. 

Proof. A classical application of the ballot theorem shows that the 

conditional distribution of (Co.(i), . . . ,Co-(fc)) given T^~^^ =n and Ai = k is 

the same as that of (Ci, . . . , C^) conditioned on J2i=i —"^ s-^id Y^^=i — 
k — 1. Then note that 

k k n 

Y^\Ci\=n and ^Mi = A:-l ^ T^^^=n and J2^^'^^ = k-1 

i=l 1=1 i=l 

and an application of the ballot theorem (much in the same way as in the 
proof of Theorem 1) shows that the probability of that event equals 
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Theorem l(ii) completes the proof. □ 

Next, we normahze the size |Ci| of each cluster by the size of the 
total population (recall we focus on the descent of a single ancestor, namely 
Eve), and write 

Ti > > • • • > r^i 

for the sequence which is obtained by ranking the ratios |Ci|/T;^^^ in the 
decreasing order. So F = (Fi, . . . , Ta-i) is a proper partition of the unit mass, 
in the sense that it is given by a ranked sequence of positive real numbers 
with sum 1. The space of mass partitions (possibly with infinitely many 
strictly positive terms and sum less than 1) is endowed with the supremum 
distance, which yields a compact metric space; see Section 2.1 in [13] for 
details. 

Our purpose now is to investigate the asymptotic behavior of the random 
mass partition F, under the conditional probability given the size t['^'^ = n 
of the tree Ti and the number = A; of alleles, when n,k ^ oo. We shall 
show that, under appropriate hypotheses, one can establish convergence in 
distribution, where the limit can be described as follows. For some fixed 
parameter 6 > 0, consider the sequence ai > a2 > • • • > of the atoms ranked 
in the decreasing order of a Poisson point measure on ]0, oo[ with intensity 
6a~^/^ da. Roughly speaking, we then get a random proper mass-partition 
by conditioning on X^i^i ^-i = 1; see, for example, [35] or Proposition 2.4 in 
[13] for a rigorous definition of this conditioning by a singular event. 

This family of random mass-partitions has appeared previously in a re- 
markable work by Aldous and Pitman [6] , more precisely it arose by logging 
the Continuum Random Tree according to Poissonian cuts along its skele- 
ton; see also [3, 7, 12, 32] for related works. In the present setting, we may 
interpret such cuts as mutations which induce an allelic partition. As we 
know from Aldous [4] that the Continuum Random Tree can be viewed as 
the limit when n — > oo of Galton- Watson trees conditioned to have total size 
n, the fact that the preceding random mass-partitions appear again in the 
framework of this work should not come as a surprise. 

For the sake of simplicity, we shall focus on the case when the number 
of clone-children ^^'^^ and the number of mutant-children ^^^^ are indepen- 
dent, although it seems likely that our argument should also apply to more 
general situations. Recall that the expected number of clone-children of a 
typical individual is E(,^('^)) < 1. We shall work under the hypothesis that by 
a suitable exponential tilting, this subcritical random variable can be turned 
into a critical one with finite variance. That is, we shall assume that there 
exists a real number 9 > 1 such that 

(7) E(e(^)0«''^)=E(0«''^) and := E((e(")) V'Ve(6'«''') - K oo. 
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It can be readily checked that (7) then specifies 6 uniquely. 

Proposition 1. Suppose that (^^^^ and are independent, that nei- 
ther distribution is supported by a strict subgroup of TL and that (7) holds. 
Fix b> and let n,k ^ oo according to the regime k ~ by/n. Then the condi- 
tional law of r given that the size of the total population is T^^^ = n and the 
number of alleles Ai = k converges weakly on the space of mass-partitions 
to the sequence (ai,a2,...) of the atoms of a Poisson random measure on 
]0,oo[ with intensity 

b 

= da, a > 0, 

ranked in the decreasing order and conditioned by X^i^i S-i = 1- 

Remark. The special case when ^^^'^ and are two independent Pois- 
son variables, say with rates r^'^^ and r^™) can also be viewed as an instance 
of the situation where mutations affect children independently with proba- 
bility p = r('^)/(r('^) -)-7-(™)) (cf. the example discussed before Theorem 1). 
More precisely the reproduction law of the standard Gallon- Watson process 
is then Poisson with rate r^^^ -\- r^™\ This special case has some importance, 
as it is well known that conditioning a Gallon- Watson tree with Poisson(l) 
reproduction law to have a size n and then assigning to each individual a 
distinct label in {1, . . . , n} by uniform sampling without replacements yields 
the uniform distribution on the set of rooted trees with n labeled vertices. 

Proof. Let P denote the probability measure which is obtained from P 
by exponential tilting, and more precisely, in such a way that the variables 
^\ , . . . are i.i.d. under P with law given by 

where zq is the normalization factor, namely, 

ze = E{e^^''). 

As in the proof of Corollary 2, we see from an application of the ballot 
theorem that the conditional distribution of {nTi, . . . ,nTAi) given T^~^^ = n 
and Ai = k\s the same as that obtained from the i.i.d. sequence |Ci|, . . . , |Cfc| 
by ranking in the decreasing order and conditioning on Ym.=i 1*^*1 ~ ^ 
Yji=iMi = k — 1. Observe that the latter is equivalent to conditioning on 

Yli=i = n and J2'i=i^i^^ = k — 1. Further, recall from Lemma 3 that 
|Cj| = Tj^^ — Tj'^^ and hence, on this event, the variables |Ci|, . . . , |Cfc| are 
functions of . . . ,Cn^ ■ Thus the assumption of independence between ^^'^^ 
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and enables us to ignore the conditioning on J2i'=iCi^^ = k — 1. Fi- 
nally, it should be clear that the exponential tilting does not affect such a 
conditional law, in the sense that the sequence |Ci|, . . . , |Cfc| has the same 
distribution under P(- | T^~^^ = n) as under P(- | T^^^ = n). 

We then estimate the distribution of the size of the Eve cluster under P, 
which is given again according to the Dwass formula [19] by 

P(|Ci| = m) = i-P(ef ) + • • • + = ni - 1) = i-P(sW = -1). 

Recall that, by assumption, ^^^'^ is critical with variance under P, so an 
application of Gnedenko's local central limit theorem gives 

P(|Ci| = ni) ~ = as ni oo. 

Putting the pieces together, we get that the conditional distribution of 

{nVi, . . . juTai) given T^^^ = n and Ai = k is the same as that obtained 
from an i.i.d. sequence Yi,...,Yk by ranking in the decreasing order and 
conditioning on J2i=i Yi = n, where 

P(Yi = ni) ^ = as rii oo. 

W27r(7|7T,f 

An application of Corollary 2.2 in [13] completes the proof of our claim. □ 

4. Levy forests with mutations. The purpose of this section is to point 
at an interpretation of a standard limit theorem involving left-continuous 
(i.e., downward skip-free) random walks and Levy processes with no negative 
jumps, in terms of Galton- Watson and Levy forests in the presence of neutral 
mutations. We first introduce some notation and hypotheses in this area, 
referring to the monograph by Duquesne and Le Gall [17] for details. 

For every integer n > 1, let (C^'H^),?^"'^H) be a pair of integer-valued 
random variables with 

E(e^'^(n)+e^'")(n)) = l. 

We consider two left-continuous random walks 

5(+)(n) = (5f+^(n):ieZ+) and ^(^^(n) = (^f ^(n) : i G Z+), 

whose steps are (jointly) distributed as ^^^\n) := S^^^\n) + ^^^\n) — 1 and 
^('^)(n) — 1, respectively. Let also X = {Xt,t G M+) denote a Levy process 
with no negative jumps and Laplace exponent ip, namely, 

E(exp —XXt) = exptip{X) for every A, t > 0. 
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We further suppose that X does not drift to +00, which is equivalent to 
^'(0+) >0, and that 

dX 



We also need to introduce a different procedure for encoding forests by 
paths, which is more convenient to work with when discussing continuous 
limits of discrete structures. For each n > 1, we write H(n) = {Hi{n),i £ N) 
for the (discrete) height function of the Galton- Watson forest (T£,£sN). 
That is, for i > 0, Hi{n) denotes the generation of the (i + l)th individual 
found by the usual depth- first search (i.e., mutations are discarded) on the 
Galton- Watson forest. In the continuous setting, trees and forests can be 
defined for a fairly general class of Levy processes with no negative jumps, 
and in turn are encoded by (continuous) height functions; cf. Chapter 1 in 
[17] for precise definitions and further references. 

The key hypothesis in this setting is the existence of a nondecreasing 
sequence of positive integers (7^ , n G N) converging to 00 and such that 

(8) lim n-^S^t) in) = Xi in law; 

we also assume that the technical condition (2.27) in [17] is fulfilled. Then 
the rescaled height function 

{7-^H[tnjJn):t>0) 

converges in distribution, in the sense of weak convergence on Skorohod 
space D(]R+,]R+) as n — > 00 toward the height process {Ht'-t > 0) which is 
constructed from the Levy process X = {Xt,t > 0); see Theorem 2.3.1 in 
[17]. 

Similarly, we write H^'^'^ (n) = {H^^^ (n) , i € N) for the height function of the 
Galton- Watson forest {Cj , j G N) , where each allelic cluster Cj is endowed 
with the genealogical tree structure induced by the population model (see 
Remark, item 1 in Section 2.3). 

Proposition 2. Suppose that the preceding assumptions hold, and also 
that 

(9) lim 7„E(^(™)(n)) =(i and lim n'Sn Var(^(™)(n)) = 

n— »oo 

for some d>0. Then the rescaled height function 

iln'41,Jn):t>0) 

converges in distribution, in the sense of weak convergence on Skorohod space 
D(R+,M+) as n — > CO toward the height process 

(?^f^^>o), 

which is constructed from the Levy process 

xid) ^ ^xl'^) --Xt-dt^tyo). 
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Remark. More recently, Duquesne and Le Gall [18] (see also the sur- 
vey [29]) have developed the framework when Levy trees are viewed as 
random variables with values in the space of real trees, endowed with the 
Gromov-Hausdorff distance. Proposition 2 can also be restated in this set- 
ting. 

Proof of Proposition 2. The assumption (8) ensures the conver- 
gence in distribution 

see Theorem 2.1.1 in [17] and (2.3) there. On the other hand, by a routine 
argument based on martingales, the assumption (9) entails that 

uniformly for t in compact intervals, in L^(P). The convergence in distribu- 
tion 

(^"'^[in] {n):t>0)^{Xt-dt:t> 0) 

follows. Recall that depth- first search with mutations on the initial forest 
yields the usual depth-first search for the forest of allelic clusters (cf. Remark, 
item 1 in Section 2.3). We can then complete the proof as in Theorem 2.3.1 
in [17]. □ 

We now conclude this work by discussing a natural example. Specifically, 
we suppose that the distribution of 

^^''Hn)+&Hn)=an) := C 

is the same for all n. For the sake of simplicity, we assume also that E(^) = 1 
and Var(,^) = 1. We may then take 7„ = n, so by the central limit theo- 
rem, (8) holds and the Levy process X is a standard Brownian motion. 
We fix an arbitrary d > and consider the independent pruning model 
where for each integer n> d, conditionally on the total number of children 
(,^~^\n) := S^^^\n) +^^™\n) = k, the number ^^™-*(n) of mutant-children of a 
typical individual has the binomial distribution B{k,d/n). In other words, 
in the nth population model, mutations affect each child with probability 
d/n, independently of the other children. Then (9) clearly holds. Roughly 
speaking. Theorem 2.3.1 of [17] implies in this setting that the initial Galton- 
Watson forest associated with the nth population model, converges in law 
after a suitable renormalization to the Brownian forest, whereas Proposi- 
tion 2 of the present work shows that the allelic forest renormalized in the 
same way, converges in law to the forest generated by a Brownian motion 
with drift —d. 



ALLELIC PARTITION OF BRANCHING PROCESSES 



21 



This provides an explanation to the rather intriguing relation which iden- 
tifies two seemingly different fragmentation processes: the fragmentation 
process constructed by Aldous and Pitman [6] by logging the Continuum 
Random Tree according to a Poisson point process on its skeleton, and the 
fragmentation process constructed in [12] by splitting the unit interval at 
instants when a Brownian excursion with negative drift reaches a new infi- 
mum. It is interesting to mention that Schweinsberg [38] already pointed at 
several applications of the (continuous) ballot theorem in this framework. 
More generally, the transformation X ^ X^'^^ of Levy processes with no neg- 
ative jumps also appeared in an article by Miermont [32] on certain eternal 
additive coalescents, whereas Aldous and Pitman [7] showed that the lat- 
ter arise asymptotically from independent pruning of certain sequences of 
birthday trees. Finally, we also refer [2] for another interesting recent work 
on pruning Levy random trees. 

Acknowledgment. I would like to thank two anonymous referees for their 
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